Violences sur AskMeuf et AskMec

Document produit le 28/01/2026

1 Résumé des épisodes précédents et nouveaux dilemnes

J’avais fait un lexique incomplet, qui est maintenant un peu plus complet en incluant les noms au pluriel, d’autres déclinaison de verbes. Je n’ai pas pris tout les termes proposés : par exemple, coup et coups renvoyait beaucoup trop du posts à cause de “du coup”, “coup d’un soir”, “boire des coups”… L’intégration des fautes d’orthographe s’est avéré inutile et ne renvoyait aucune occurrence : vive les correcteurs automatiques ! J’ai tenté d’inclure le lexique d’autres violences à partir de la liste que tu m’as fourni : le fait que je fais la recherche et la sélection que sur les posts classé relations fait que certains de ces termes n’ont aucune occurrences, notamment ceux des violences numériques (exception faite de revenge porn, mais qui n’a qu’une occurrence dans les posts et 30 dans les commentaires).

L’inclusion de ces nouveaux termes a drastiquement augmenté la taille de l’échantillon, ce que je détaille plus loin. Cela m’interroge sur à quel point il y a du bruit : par exemple, les mots colère, énervé sont très courant en parlant de relations, sans que cela ne souligne pour autant une situation de violence. Je pense qu’il y a encore des arbitrages à faire, peut-être en s’appuyant sur les termes les plus communs pour réduire un peu le bruit. Idem pour battre, qui peut être “me battre pour lui”. Il y a toujours l’option de faire un lexique de ces termes aux “signaux faibles” et ne sélectionner les posts qui les contiennent que si ils contiennent un autre terme du lexique ou si ils ont quelques commentaires qui en contiennent… J’ai du mal à avoir le recul nécessaire pour l’instant. D’un côté, il est intéressant d’objectifier la prégnance du lexique des violences dans les discours sur les relations. De l’autre, si l’on cherche à sélectionner des discours de situation de violence, cette approche est sans doute insastisfaisante.

2 Nouveau lexique

On essaie de prendre en compte plusieurs type de violences, je regroupe les termes dans des catégories. Voici la liste complète des termes retenu :

Voir le code
# Termes du lexique par catégorie de violences
generale <- c("violemment", "violamment", "violence", "violences", "violente",
              "violentes", "violent", "violents", "violentometre","violenter",
              "victime", "victimes", "tw")

physique <- c("blessure", "blessures", 
              "hematome", "hematomes", 
              "fracture", "fractures",
              "passage à tabac", "strangulation", "sequestration", "sequestrations", "torture", 
              "gifle", "gifler", "gifles", 
              "bousculer", "bouscule", 
              "battre", "battu", "battue", 
              "subir des coups", 
              "frappe", "frapper",
              "poing", "poings", 
              "baffe", "baffes", 
              "battre", "battu", "battus", "battue", "battues", 
              "blessure", "blessures",  
              "claque", "claques", "etrangler", "etrangle", "taper")

verbale <- c("injures", 
             "menaces", "menace", "menacer", 
             "propos degradants", 
             "abus", "abusif", "abusive", "abusifs", "abusives", "abuse", "abuses", "abuser",
             "colere", "coleres",
             "crier", "cri", "cris", 
             "insultant", "insultante", "insulte", "insultes", "insulter", 
             "traiter",  "verbalement",
             "enervement", "enervements")

psy <- c("intimidation", "intimidations", 
         "chantage", "chantages", 
         "culpabilisation", 
         "contrôle", "m isole", "m isolait", 
         "rabaissser", "rabaisse",
         "manipulation", "manipulations", "manipule", "manipuler", 
         "intimider", 
         "gaslighting", "gaslight", "gaslighte", 
         "mettre la pression",
         "humiliation", "humiliations", "humilier", "humilie", "humilies",
         "narcissique")

harcel <- c("stalk", "stalke", "stalking", "stalkait", "stalkais",  
            "surveiller", "surveille", 
            "traquer", "traque", 
            "importuner", "messages repetes", 
            "insistance", 
            "intrusif", "intrusive",
            "harcele", "harcelement", "harcelements", "harceler")

sexuelle <- c("coercition", 
              "consentement", 
              "contraindre",
              "contrainte", "contraint", "sous contrainte", 
              "accoste", "accostes", "accoster", "attraper",
              "agresse", "agresses", "agresser", "agression", 
              "agressions", "agression sexuelle", "agressions sexuelles", 
              "attouchement", "attouchements", "attouche",
              "inceste", "incestes", "incestueux", "incestueuse", "incestueuses", "incestuel", "incestuels", "incestuelle", "incestuelles",
              "pedophilie", "pedophile", "pedophiles",   
              "rapport non consenti", "rapports non consentis", 
              "sexe non consenti",
              "viol", "viole", "viols", "violes", "violer")

domeco <- c("maltraitance", 
            "maltraitances", 
            "dependance", 
            "privation", 
            "conjugal", "conjugals", "conjugale", "conjugales", "conjugaux",
            "emprise"
            )

numerique <- c("cyberharcelement", 
               "doxxing", 
               "doxxe", 
               "revenge porn", 
               "piratage", 
               "usurper l identite")

discrimination <- c("racisme", "raciste", 
                    "sexisme", "sexiste",
                    "mysogine", "mysoginie",
                    "homophobie", "homophobe", 
                    "transphobie", "transphobe", 
                    "validisme", "validiste", 
                    "antisemitisme", "antisemite", 
                    "islamophobie", "islamophobe", 
                    "discrimination", "discriminations", "discrimine",
                    "stigamatise", "stigmatiser", 
                    "attaquer", "attaque")

insti_symb <- c("negligence", 
                "injustice", "injustices", 
                "sterotype", "stereotypes",
                "domination", "dominations", 
                "mepris", "meprise", "meprisant", "meprisante", 
                "invisibilisation", "invisibilise", "invisibiliser", 
                "impose", "imposer", "minimiser", "minimise", 
                "ridiculiser", "ridiculise", "marginaliser", "marginalise", 
                "normaliser", "normalise",
                "plainte", "plaintes", "police",
                "temoignage", "temoignages", "temoigner", "temoin", "temoins")

# Create named list for easier processing
violence_categories <- list(
  generale = generale,
  physique = physique,
  verbale = verbale,
  psy = psy,
  harcel = harcel,
  sexuelle = sexuelle,
  domeco = domeco,
  numerique = numerique,
  discrimination = discrimination,
  insti_symb = insti_symb
)

# Simplified function to map terms to groups
map_term_to_group <- function(term, category) {
  term_lower <- tolower(term)
  
  # First, handle multi-word terms (keep them as is)
  multi_word_terms <- c(
    "passage à tabac", "propos degradants", "m isole", "m isolait", 
    "mettre la pression", "messages repetes", "sous contrainte",
    "agression sexuelle", "agressions sexuelles", "rapport non consenti",
    "rapports non consentis", "sexe non consenti", "revenge porn",
    "usurper l'identite", "subir des coups"
  )
  
  if (term %in% multi_word_terms) {
    return(term)  # Keep multi-word terms as is
  }
  
  # Single word terms - simple stem extraction
  case_when(
    # Keep original mapping for terms you already had
    str_detect(term_lower, "\\b(accoste|accoster)\\b") ~ "accost",
    str_detect(term_lower, "\\b(attraper|attrape)\\b") ~ "attrape",
    str_detect(term_lower, "\\b(agresse|agresser|agression)\\b") ~ "agress",
    str_detect(term_lower, "\\b(attouchement|attouchements|attouche)\\b") ~ "attouchement",
    str_detect(term_lower, "\\b(frappe|frapper|poing|poings)\\b") ~ "frappe/poing",
    str_detect(term_lower, "\\b(abusif|abusive|abus|abusifs|abusives|abuse|abuses|abuser)\\b") ~ "abus",
    str_detect(term_lower, "\\b(harcele|harcelement|harceler)\\b") ~ "harcel",
    str_detect(term_lower, "\\b(inceste|pedophilie)\\b") ~ "inceste/pedophilie",
    str_detect(term_lower, "\\b(insultant|insultante|insulte|insulter|insultes)\\b") ~ "insult",
    str_detect(term_lower, "\\b(viol|viole|violer|viols|violes)\\b") ~ "viol",
    str_detect(term_lower, "\\b(violemment|violamment|violence|violences|violent|violents|violente|violentes|violentometre|violenter)\\b") ~ "violen",

    str_detect(term_lower, "\\b(conjugal|conjugale|conjugaux)\\b") ~ "conjugal",
    
    # New groupings based on your terms
     str_detect(term_lower, "^victim") ~ "victime",

    # Physical violence
    str_detect(term_lower, "^blessur") ~ "blessure",
    str_detect(term_lower, "^hematom") ~ "hematome",
    str_detect(term_lower, "^fractur") ~ "fracture",
    str_detect(term_lower, "^strangul") ~ "strangulation",
    str_detect(term_lower, "^sequestr") ~ "sequestration",
    str_detect(term_lower, "^tortur") ~ "torture",
    str_detect(term_lower, "^gifl") ~ "gifle",
    str_detect(term_lower, "^bouscul") ~ "bousculer",
    str_detect(term_lower, "^batt") ~ "battre",
    str_detect(term_lower, "^claqu") ~ "claque",
    str_detect(term_lower, "^etrangl") ~ "etrangler",
    str_detect(term_lower, "^tap") ~ "taper",
    str_detect(term_lower, "^baff") ~ "baffe",
    
    # Verbal violence
    str_detect(term_lower, "^injure") ~ "injure",
    str_detect(term_lower, "^menac") ~ "menace",
    str_detect(term_lower, "^rabais") ~ "rabaisser",
    str_detect(term_lower, "^coler") ~ "colere",
    str_detect(term_lower, "^cri") ~ "crier",
    str_detect(term_lower, "^trait") ~ "traiter",
    
    # Psychological violence (including humiliation)
    str_detect(term_lower, "^intimid") ~ "intimidation",
    str_detect(term_lower, "^chantag") ~ "chantage",
    str_detect(term_lower, "^culpabilis") ~ "culpabilisation",
    str_detect(term_lower, "^contrôl") ~ "contrôle",
    str_detect(term_lower, "^manipul") ~ "manipulation",
    str_detect(term_lower, "^gaslight") ~ "gaslighting",
    str_detect(term_lower, "^humili") ~ "humiliation",
    str_detect(term_lower, "^narciss") ~ "narcissique",
    
    # Harassment
    str_detect(term_lower, "^stalk") ~ "stalking",
    str_detect(term_lower, "^surveill") ~ "surveiller",
    str_detect(term_lower, "^traqu") ~ "traquer",
    str_detect(term_lower, "^importun") ~ "importuner",
    str_detect(term_lower, "^insist") ~ "insistance",
    str_detect(term_lower, "^intrus") ~ "intrusif",
    
    # Sexual violence
    str_detect(term_lower, "^coercit") ~ "coercition",
    str_detect(term_lower, "^consent") ~ "consentement",
    str_detect(term_lower, "^contraint") ~ "contrainte",
    str_detect(term_lower, "^pedophil") ~ "pedophile",
    
    # Domestic/Economic violence
    str_detect(term_lower, "^maltrait") ~ "maltraitance",
    str_detect(term_lower, "^depend") ~ "dependance",
    str_detect(term_lower, "^privat") ~ "privation",
    str_detect(term_lower, "^empris") ~ "emprise",
    str_detect(term_lower, "^enerv") ~ "enervement",
    
    # Digital violence
    str_detect(term_lower, "^cyberharcel") ~ "cyberharcelement",
    str_detect(term_lower, "^doxx") ~ "doxxing",
    str_detect(term_lower, "^pirat") ~ "piratage",
    
    # Discrimination
    str_detect(term_lower, "^racis") ~ "racisme",
    str_detect(term_lower, "^sexis") ~ "sexisme",
    str_detect(term_lower, "^mysog") ~ "mysoginie",
    str_detect(term_lower, "^homophob") ~ "homophobie",
    str_detect(term_lower, "^transphob") ~ "transphobie",
    str_detect(term_lower, "^validis") ~ "validisme",
    str_detect(term_lower, "^antisemi") ~ "antisemitisme",
    str_detect(term_lower, "^islamophob") ~ "islamophobie",
    str_detect(term_lower, "^discrimin") ~ "discrimination",
    str_detect(term_lower, "^stigmat") ~ "stigmatisation",
    str_detect(term_lower, "^attaqu") ~ "attaque",
    
    # Institutional/Symbolic violence
    str_detect(term_lower, "^plaint") ~ "plainte",
    str_detect(term_lower, "^temoi") ~ "temoign",
    str_detect(term_lower, "^police") ~ "police",
    str_detect(term_lower, "^neglig") ~ "negligence",
    str_detect(term_lower, "^injust") ~ "injustice",
    str_detect(term_lower, "^stereotyp") ~ "stereotype",
    str_detect(term_lower, "^domin") ~ "domination",
    str_detect(term_lower, "^mepris") ~ "mepris",
    str_detect(term_lower, "^invisibil") ~ "invisibilisation",
    str_detect(term_lower, "^impos") ~ "imposer",
    str_detect(term_lower, "^minimis") ~ "minimiser",
    str_detect(term_lower, "^ridiculis") ~ "ridiculiser",
    str_detect(term_lower, "^marginalis") ~ "marginaliser",
    str_detect(term_lower, "^normalis") ~ "normaliser",
    
    # Keep all other terms as is (no declension)
    TRUE ~ term
  )
}

# Create dataframe
df_lexique <- data.frame()

for (category_name in names(violence_categories)) {
  category_terms <- violence_categories[[category_name]]
  
  for (term in category_terms) {
    group <- map_term_to_group(term, category_name)
    
    df_lexique <- rbind(df_lexique, data.frame(
      type_violence = category_name,
      terme_original = term,
      racine = group,  # Using group as the root
      stringsAsFactors = FALSE
    ))
  }
}

# Remove duplicates
df_lexique <- df_lexique %>%
  distinct(type_violence, terme_original, .keep_all = TRUE)

# Order by type_violence and terme_original
df_lexique <- df_lexique %>%
  arrange(type_violence, terme_original)

# Create a version with all forms grouped by root
df_lexique_grouped <- df_lexique %>%
  group_by(type_violence, racine) %>%
  summarise(
    formes = paste(sort(unique(terme_original)), collapse = ", "),
    nb_formes = n(),
    .groups = 'drop'
  ) %>%
  arrange(type_violence, racine)

resume <- df_lexique_grouped %>% 
  group_by(type_violence) %>% 
  summarise(all_racine= paste0(racine, collapse = ", "),
         all_termes = paste0(formes, collapse = ", ")) %>% 
  ungroup()

resume_dt <- datatable(resume,
            filter = "top",                # 🔹 adds per-column filters

                options = list(
    pageLength = 10,           # Show 15 rows per page
    scrollX = TRUE,            # Horizontal scroll if needed
    scrollY = "400px"         # Fixed height with vertical scroll
  ),
  class = 'display compact',   # Compact styling
  rownames = FALSE             # Hide row numbers
)

resume_dt

3 Comptes de post, de commentaires et d’occurences

Voir le code
grouped_terms <- df_lexique %>%
  group_by(racine) %>%
  summarise(terms = list(unique(terme_original)), .groups = 'drop') %>%
  {setNames(.$terms, .$racine)}

# Function to count documents containing ANY term from a group
count_docs_with_term_group <- function(texts, term_group) {
  if (length(texts) == 0) return(0)
  # Create pattern for all terms in the group
  pattern <- paste0("\\b", term_group, "\\b", collapse = "|")
  sum(str_detect(texts, regex(pattern, ignore_case = TRUE)))
}

# Function to count TOTAL OCCURRENCES of terms from a group
count_total_occurrences <- function(texts, term_group) {
  if (length(texts) == 0) return(0)
  # Create a pattern for all terms in the group
  pattern <- paste0("\\b", term_group, "\\b", collapse = "|")
  
  # Count ALL occurrences (not just documents)
  total_count <- sum(str_count(texts, regex(pattern, ignore_case = TRUE)))
  
  return(total_count)
}

# Count for posts in specific dataset (sex_etc)
nb_post_relation <- sapply(grouped_terms, function(terms) {
  count_docs_with_term_group(sex_etc$selftext2, terms)
})

occu_post_relation <- sapply(grouped_terms, function(terms) {
  count_total_occurrences(sex_etc$selftext2, terms)
})

# Count for posts in all base dataset (all_sub2)
nb_post_allbase <- sapply(grouped_terms, function(terms) {
  count_docs_with_term_group(all_sub2$selftext2, terms)
})

nb_occu_post_allbase <- sapply(grouped_terms, function(terms) {
  count_total_occurrences(all_sub2$selftext2, terms)
})

# Count for comments in specific dataset (com_sex_etc)
nb_com <- sapply(grouped_terms, function(terms) {
  count_docs_with_term_group(com_sex_etc$body2, terms)
})

# Also count total occurrences in comments if needed
occu_com <- sapply(grouped_terms, function(terms) {
  count_total_occurrences(com_sex_etc$body2, terms)
})

nb_com_allbase <-   sapply(grouped_terms, function(terms) {
  count_docs_with_term_group(com$body2, terms)
})
occu_com_allbase  <- sapply(grouped_terms, function(terms) {
  count_total_occurrences(com$body2, terms)
})

# Create final dataframe with statistics
stat_post <- data.frame(
  term_group = names(grouped_terms),
  terms = sapply(grouped_terms, paste, collapse = ", "),
  nb_post_relation = nb_post_relation,
  occu_post_relation = occu_post_relation,
  nb_post_allbase = nb_post_allbase,
  nb_occu_post_allbase = nb_occu_post_allbase,
  nb_com = nb_com,
  occu_com = occu_com,
  nb_com_allbase = nb_com_allbase,
  occu_com_allbase = occu_com_allbase,
  stringsAsFactors = FALSE
) %>%
  mutate(
    # Add violence type from your df_lexique dataframe
    type_violence = sapply(term_group, function(x) {
      types <- unique(df_lexique$type_violence[df_lexique$racine == x])
      if(length(types) > 1) paste(types, collapse = ", ") else types
    })
  ) %>%
  # Select and order columns
  select(type_violence, term_group, terms, 
         nb_post_relation,  nb_com, occu_post_relation, occu_com,
         nb_post_allbase,  nb_occu_post_allbase,
           nb_com_allbase, occu_com_allbase) %>%
  arrange(desc(nb_post_relation))

dt <- datatable(
  stat_post,
  filter = 'top',
  options = list(
    pageLength = 15,
    scrollX = TRUE,
    scrollY = "500px",
    columnDefs = list(
      list(targets = 2, width = '200px'),
      list(targets = c(3:6), className = 'dt-center'),
      # Add select filter for type_violence column
      list(
        targets = which(names(stat_post) == "type_violence") - 1,
        searchable = TRUE,
        filter = list(
          position = 'top',
          type = 'select',
          values = unique(stat_post$type_violence)
        )
      )
    )
  ),
  class = 'display compact stripe hover',
  rownames = FALSE,
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center;',
    'Violence Terms Statistics by Group',
    htmltools::tags$br(),
    htmltools::tags$small('Filter by violence type using the dropdown above each column')
  )
)

dt

On peut donc faire un résumé des statistiques des différents lexiques de violences pris en compte.

Voir le code
# You can also create a summary by violence type
summary_by_type <- stat_post %>%
  group_by(type_violence) %>%
  summarise(
    nb_groups = n(),
    total_post_relation = sum(nb_post_relation),
    total_nb_com = sum(nb_com),
    total_occu_post_relation = sum(occu_post_relation),
    total_occu_com = sum(occu_com),
    total_post_allbase = sum(nb_post_allbase),
    total_com_allbase = sum(nb_com_allbase),
    total_ocucom_allbase = sum(occu_com_allbase),
    .groups = 'drop'
  ) %>%
  arrange(desc(total_post_relation))

# Create a separate datatable for the summary
dt_summary <- datatable(
  summary_by_type,
  options = list(pageLength = 10, scrollX = TRUE),
  class = 'display compact',
  rownames = FALSE,
  caption = 'Résumé des stats par groupe de violence'
)

dt_summary

4 Création des bases de données

On peut donc créer deux bases de données : celles avec les posts contenant les termes choisis, celui avec les commentaires contenant les termes. On ne recherche que dans les posts classé “Relation Affectives et sexuelles” pour limiter le bruit. Ci-dessous, 10 posts tirés aléatoirement et le mot qui a causé leur inclusion dans la base de donnée.

Voir le code
term_etendu_vector <- c(generale, physique, verbale, psy, harcel, sexuelle, domeco, numerique, discrimination, insti_symb)
term_etendu <- paste0("\\b", c(term_etendu_vector) , "\\b", collapse = "|")


vss_post_plus <- all_sub2 %>% 
  filter(typo_label == "Relations affectives et sexuelles") %>% 
  filter(str_detect(selftext2, regex(term_etendu, ignore_case = TRUE))) %>% 
  mutate(
    matches = str_extract_all(selftext2, regex(term_etendu, ignore_case = TRUE)),
    flag = map_chr(matches, ~paste(unique(.x), collapse = ", "))
  ) %>% 
  select(-matches)

nb_post <- nrow(vss_post_plus)

test <- vss_post_plus %>% 
  sample_n(10) %>% 
  select(selftext2, flag) 


dt <- datatable(test,
                options = list(
    pageLength = 5,           # Show 15 rows per page
    scrollX = TRUE,            # Horizontal scroll if needed
    scrollY = "400px"         # Fixed height with vertical scroll
  ),
  class = 'display compact',   # Compact styling
  rownames = FALSE             # Hide row numbers
)

dt

On obtient donc une base de données de 1066 posts.

On applique la même méthode aux commentaires, en supprimant les commentaires de l’équipe de modérations qui contiennent souvent le mot “insultes”. Ci-dessous, 10 commentaires tirés aléatoirement et le mot qui a causé leur inclusion dans la base de donnée.

Voir le code
vss_com_plus <- com %>% 
  filter(author != "AskMec-ModTeam" & author != "AskMec-ModTeam" ) %>% 
  filter(typo_post == "Relations affectives et sexuelles" ) %>% 
  filter(str_detect(body2, regex(term_etendu, ignore_case = TRUE))) %>% 
  mutate(
    matches = str_extract_all(body2, regex(term_etendu, ignore_case = TRUE)),
    flag = map_chr(matches, ~paste(unique(.x), collapse = ", "))
  ) %>% 
  select(-matches)

com_grouped <- vss_com_plus %>% 
  group_by(clé) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) 

com_grouped_sup3 <- com_grouped %>% 
  filter(count > 3)


test2 <- vss_com_plus %>% 
  sample_n(10) %>% 
  select(body2, flag) 

dt <- datatable(test2,
                options = list(
    pageLength = 5,           # Show 15 rows per page
    scrollX = TRUE,            # Horizontal scroll if needed
    scrollY = "400px"         # Fixed height with vertical scroll
  ),
  class = 'display compact',   # Compact styling
  rownames = FALSE             # Hide row numbers
)

dt

On obtient ainsi une base de données de 18783 commentaires, répondant à 2921 posts.

4.1 Base de donnée étendue

On peut étendre le nombre de post analysé en prenant en compte les posts qui ont au moins 3 commentaires qui ont le champs lexical des violences, afin de percevoir quels posts amènent des commentaires contenant des termes de notre lexique. Certains de ces posts sont des “appels à témoignage”, et d’autres peuvent être des récits de violence sans que l’auteurice du post ne nomme ces violences. Ci-dessous, un échantillon de 10 posts ne contenant pas un mot du lexique, mais ayant au moins trois commentaires contenant un mot du lexique. Le commentaires le plus populaire contenant un mot du lexique est disponible à droite, ainsi que le mot qui a été reconnu, et le score du commentaire.

Voir le code
vss_com_plus_top <- vss_com_plus %>%
  group_by(clé) %>%
  arrange(desc(score)) %>%  # Sort by score descending (highest first)
  slice(1) %>%  # Take only the first (highest score) for each post
  ungroup()

post_plusplus <- all_sub2 %>% 
  filter(links %in% com_grouped_sup3$clé) %>% 
  filter(links %nin% vss_post_plus$links) %>% 
  mutate(com_scoremax = vss_com_plus_top$body[match(links, vss_com_plus_top$clé)],
         flag_com = vss_com_plus_top$flag[match(links, vss_com_plus_top$clé)],
         score_com = vss_com_plus_top$score[match(links, vss_com_plus_top$clé)]) 


post_plus_test <-  post_plusplus %>% 
  sample_n(10) %>% 
  select(selftext, com_scoremax,flag_com)

dt <- datatable(post_plus_test,
                options = list(
    pageLength = 5,           # Show 15 rows per page
    scrollX = TRUE,            # Horizontal scroll if needed
    scrollY = "400px"         # Fixed height with vertical scroll
  ),
  class = 'display compact',   # Compact styling
  rownames = FALSE             # Hide row numbers
)

dt

Sans compter ceux qui sont déjà inclus dans le premier df, c’est 827 posts que nous pouvons ajouter à la base.

Voir le code
post_vss <- all_sub2 %>% 
  filter(links %in% vss_post_plus$links | links %in% post_plusplus$links)%>% 
  mutate(prov = ifelse(links %in% vss_post_plus & links %nin% post_plusplus$links, "post",
                       "com")) %>% 
  mutate(flag = case_when(prov == 'post' ~ vss_post_plus$flag[match(links,vss_post_plus$links )],
                          prov == 'com' ~ "com flag"),
         flag_com = case_when(prov == 'com' ~ post_plusplus$flag_com[match(links, post_plusplus$links)]),
         com_max = ifelse(prov == "com", 
                          post_plusplus$com_scoremax[match(links, post_plusplus$links)],
                          "post_flag"))




post_vss <- all_sub2 %>% 
  filter(links %in% vss_post_plus$links | links %in% post_plusplus$links) %>% 
  mutate(
    prov = ifelse(links %in% vss_post_plus$links & links %nin% post_plusplus$links, "post", "com"),
    flag = case_when(
      prov == 'post' ~ vss_post_plus$flag[match(links, vss_post_plus$links)],
      prov == 'com' ~ "com flag"
    ),
    flag_com = case_when(
      prov == 'com' ~ post_plusplus$flag_com[match(links, post_plusplus$links)],
      prov == 'post' ~ "post_flag"  # Explicitly set to NA for post rows
    ),
    com_max = ifelse(
      prov == "com", 
      post_plusplus$com_scoremax[match(links, post_plusplus$links)],
      "post_flag"
    )
  )

com_post_vss <- com %>% 
  filter(clé %in% post_vss$links)

En fusionnant les deux base de données, on a donc 1893 posts traitants des violences, et si on récupère l’ensemble des commentaires de tout ces posts (qui peuvent donc ne pas contenir de termes sur les violences), on obtient 117720 commentaires.

5 Statistiques descriptives

5.1 Posts

On peut maintenant commencer à faire des statistiques descriptives. On a finalement trois bases de données pour les posts, et deuxbases de données pour les commentaires.

Pour les posts:

  • Les posts contenant les mots du lexique (base stricte 1066 posts)

  • Les posts avec au moins 3 commentaires (extension de la la base 827 posts )

  • La fusion de ces deux base (base étendue, 1893 posts)

5.1.1 Description Base stricte

Voir le code
vss_post_plus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "cell") %>%
  modify_caption("**Pourcentages par cellule**")
Pourcentages par cellule
subreddit
Total
AskMec AskMeuf
genre2


    Femme 77 (7.2%) 487 (46%) 564 (53%)
    Homme 144 (14%) 268 (25%) 412 (39%)
    Unknown 12 (1.1%) 78 (7.3%) 90 (8.4%)
Total 233 (22%) 833 (78%) 1,066 (100%)
Voir le code
vss_post_plus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "row") %>%
  modify_caption("**Pourcentages en ligne**")
Pourcentages en ligne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 77 (14%) 487 (86%) 564 (100%)
    Homme 144 (35%) 268 (65%) 412 (100%)
    Unknown 12 (13%) 78 (87%) 90 (100%)
Total 233 (22%) 833 (78%) 1,066 (100%)
Voir le code
vss_post_plus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "column") %>%
  modify_caption("**Pourcentages en colonne**")
Pourcentages en colonne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 77 (33%) 487 (58%) 564 (53%)
    Homme 144 (62%) 268 (32%) 412 (39%)
    Unknown 12 (5.2%) 78 (9.4%) 90 (8.4%)
Total 233 (100%) 833 (100%) 1,066 (100%)

5.1.2 Description extension de base

Voir le code
post_plusplus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "cell") %>%
  modify_caption("**Pourcentages par cellule**")
Pourcentages par cellule
subreddit
Total
AskMec AskMeuf
genre2


    Femme 138 (17%) 256 (31%) 394 (48%)
    Homme 192 (23%) 182 (22%) 374 (45%)
    Unknown 23 (2.8%) 36 (4.4%) 59 (7.1%)
Total 353 (43%) 474 (57%) 827 (100%)
Voir le code
post_plusplus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "row") %>%
  modify_caption("**Pourcentages en ligne**")
Pourcentages en ligne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 138 (35%) 256 (65%) 394 (100%)
    Homme 192 (51%) 182 (49%) 374 (100%)
    Unknown 23 (39%) 36 (61%) 59 (100%)
Total 353 (43%) 474 (57%) 827 (100%)
Voir le code
post_plusplus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "column") %>%
  modify_caption("**Pourcentages en colonne**")
Pourcentages en colonne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 138 (39%) 256 (54%) 394 (48%)
    Homme 192 (54%) 182 (38%) 374 (45%)
    Unknown 23 (6.5%) 36 (7.6%) 59 (7.1%)
Total 353 (100%) 474 (100%) 827 (100%)

Les hommes sont sureprésentés dans cet échantillon par rapport à la base de post stricte (45% contre 39%). On peut interroger cet écart en regardant plus précisément ce qu’ils racontent. Ci-dessous, 10 exemples tirés aléatoirement de posts écrits par des hommes, ne contenant pas de mots du lexique, mais ayant au moins trois commentaires qui en contiennent.

Voir le code
post_plus_test <-  post_plusplus %>% 
  filter(genre2 == "Homme") %>% 
  sample_n(10) %>% 
  select(selftext, com_scoremax,flag_com )

dt <- datatable(post_plus_test,
                options = list(
    pageLength = 5,           # Show 15 rows per page
    scrollX = TRUE,            # Horizontal scroll if needed
    scrollY = "400px"         # Fixed height with vertical scroll
  ),
  class = 'display compact',   # Compact styling
  rownames = FALSE             # Hide row numbers
)

dt

5.1.3 Description base étendue

Voir le code
post_vss %>%
  tbl_cross(row = genre2, col = subreddit, percent = "cell") %>%
  modify_caption("**Pourcentages par cellule**")
Pourcentages par cellule
subreddit
Total
AskMec AskMeuf
genre2


    Femme 215 (11%) 743 (39%) 958 (51%)
    Homme 336 (18%) 450 (24%) 786 (42%)
    Unknown 35 (1.8%) 114 (6.0%) 149 (7.9%)
Total 586 (31%) 1,307 (69%) 1,893 (100%)
Voir le code
post_vss %>%
  tbl_cross(row = genre2, col = subreddit, percent = "row") %>%
  modify_caption("**Pourcentages en ligne**")
Pourcentages en ligne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 215 (22%) 743 (78%) 958 (100%)
    Homme 336 (43%) 450 (57%) 786 (100%)
    Unknown 35 (23%) 114 (77%) 149 (100%)
Total 586 (31%) 1,307 (69%) 1,893 (100%)
Voir le code
post_vss %>%
  tbl_cross(row = genre2, col = subreddit, percent = "column") %>%
  modify_caption("**Pourcentages en colonne**")
Pourcentages en colonne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 215 (37%) 743 (57%) 958 (51%)
    Homme 336 (57%) 450 (34%) 786 (42%)
    Unknown 35 (6.0%) 114 (8.7%) 149 (7.9%)
Total 586 (100%) 1,307 (100%) 1,893 (100%)

L’ajout de termes du lexique a grandement fait augmenter la part d’homme dans l’échantillon : de 35% à 39% pour la base stricte, de 38% à 41% pour la base étendue. En revanche, on a gagné pas mal de post (de 923 à 1893 pour la base étendue) et je propose donc qu’on supprime les posts écrits par des personnes dont je n’arrive pas à déduire le genre.

On a quasiment doublé le nombre de post pour la base étendu : est-ce qu’on accepte ce bruit, ou est-ce que j’adapte la règle, en faisant en sorte que ce soit 4, 5 commentaires qui provoquent l’ajout d’un post à la base ?

5.2 Pour les commentaires

Les 2 bases:

  • Les commentaires contenant les mots du lexique (base com stricte, 18783 commentaires)

  • Les commentaires répondant à un post de la base étendue (base étendue, 117720 com )

(Regarder les coms qui ne répondent qu’à l’extension ?)

5.2.1 Description base stricte

Voir le code
vss_com_plus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "cell") %>%
  modify_caption("**Pourcentages par cellule**")
Pourcentages par cellule
subreddit
Total
AskMec AskMeuf
genre2


    Femme 1,526 (8.1%) 9,312 (50%) 10,838 (58%)
    Homme 4,983 (27%) 2,694 (14%) 7,677 (41%)
    Unknown 18 (<0.1%) 250 (1.3%) 268 (1.4%)
Total 6,527 (35%) 12,256 (65%) 18,783 (100%)
Voir le code
vss_com_plus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "row") %>%
  modify_caption("**Pourcentages en ligne**")
Pourcentages en ligne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 1,526 (14%) 9,312 (86%) 10,838 (100%)
    Homme 4,983 (65%) 2,694 (35%) 7,677 (100%)
    Unknown 18 (6.7%) 250 (93%) 268 (100%)
Total 6,527 (35%) 12,256 (65%) 18,783 (100%)
Voir le code
vss_com_plus %>%
  tbl_cross(row = genre2, col = subreddit, percent = "column") %>%
  modify_caption("**Pourcentages en colonne**")
Pourcentages en colonne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 1,526 (23%) 9,312 (76%) 10,838 (58%)
    Homme 4,983 (76%) 2,694 (22%) 7,677 (41%)
    Unknown 18 (0.3%) 250 (2.0%) 268 (1.4%)
Total 6,527 (100%) 12,256 (100%) 18,783 (100%)

5.2.2 Description base étendue

Voir le code
com_post_vss %>%
  tbl_cross(row = genre2, col = subreddit, percent = "cell") %>%
  modify_caption("**Pourcentages par cellule**")
Pourcentages par cellule
subreddit
Total
AskMec AskMeuf
genre2


    Femme 11,763 (10.0%) 44,440 (38%) 56,203 (48%)
    Homme 42,914 (36%) 17,179 (15%) 60,093 (51%)
    Unknown 151 (0.1%) 1,273 (1.1%) 1,424 (1.2%)
Total 54,828 (47%) 62,892 (53%) 117,720 (100%)
Voir le code
com_post_vss %>%
  tbl_cross(row = genre2, col = subreddit, percent = "row") %>%
  modify_caption("**Pourcentages en ligne**")
Pourcentages en ligne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 11,763 (21%) 44,440 (79%) 56,203 (100%)
    Homme 42,914 (71%) 17,179 (29%) 60,093 (100%)
    Unknown 151 (11%) 1,273 (89%) 1,424 (100%)
Total 54,828 (47%) 62,892 (53%) 117,720 (100%)
Voir le code
com_post_vss %>%
  tbl_cross(row = genre2, col = subreddit, percent = "column") %>%
  modify_caption("**Pourcentages en colonne**")
Pourcentages en colonne
subreddit
Total
AskMec AskMeuf
genre2


    Femme 11,763 (21%) 44,440 (71%) 56,203 (48%)
    Homme 42,914 (78%) 17,179 (27%) 60,093 (51%)
    Unknown 151 (0.3%) 1,273 (2.0%) 1,424 (1.2%)
Total 54,828 (100%) 62,892 (100%) 117,720 (100%)

On peut également comparer la répartition de genre des auteurices de posts de la base étendue en fonction de la provenance du post : utilisation du lexique dans le post même où récupéré à travers les commentaires ?

Voir le code
post_vss %>% 
  tbl_cross(row = genre2, col = prov, percent = "cell") %>% 
  modify_caption("**Pourcentages en cellule**")
Pourcentages en cellule
prov
Total
com post
genre2


    Femme 394 (21%) 564 (30%) 958 (51%)
    Homme 374 (20%) 412 (22%) 786 (42%)
    Unknown 59 (3.1%) 90 (4.8%) 149 (7.9%)
Total 827 (44%) 1,066 (56%) 1,893 (100%)

6 Repartition de genre par lexique de violences

Les posts et commentaires dont on arrive pas à déduire le genre de l’auteur.ices sont filtrés

Voir le code
com_select <- vss_com_plus %>% 
  select(body2, genre2, flag, body, docid) %>% 
  mutate(text= body2,
         text_raw = body, 
         type = "com") %>% 
  select(-body, -body2)

post_select <- vss_post_plus %>% 
  select(selftext2, genre2, flag, selftext, docid) %>% 
  mutate(text= selftext2,
  text_raw = selftext, 
  type = "post") %>% 
  select(-selftext, -selftext2)

all_text <- rbind(com_select, post_select) %>% 
  filter(!is.na(genre2)) %>% 
  mutate(lex_generale = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% generale)),
    1, 0),
  lex_physique = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% physique)),
    1, 0
  ),
  lex_verbale = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% verbale)),
    1, 0),
  lex_psy = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% psy)),
    1, 0),
  lex_harcel = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% harcel)),
    1, 0),
  lex_sexuelle = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% sexuelle)),
    1, 0),
  lex_domeco = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% domeco)),
    1, 0),
  lex_discrimination = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% discrimination)),
    1, 0),
  lex_insti_symb = ifelse(
    map_lgl(str_split(flag, ",\\s*"), ~ any(.x %in% insti_symb)),
    1, 0
  ))


# First, let's prepare the data in a tidy format
all_text_long <- all_text %>%
  select(docid, genre2, type, starts_with("lex_")) %>%
  pivot_longer(
    cols = starts_with("lex_"),
    names_to = "category",
    values_to = "present"
  ) %>%
  filter(present == 1) %>%  # Keep only rows where the category is present
  mutate(
    category = str_remove(category, "lex_") %>%  # Remove "lex_" prefix
      str_replace_all("_", " ") %>%              # Replace underscores with spaces
      str_to_title()                             # Capitalize first letters
  )

com_long <- all_text_long %>% 
  filter(type == "com")

post_long <- all_text_long %>% 
  filter(type == "post")

# Create a summary table
summary_com <- com_long %>%
  group_by(category, genre2) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count))

summary_post <- post_long %>%
  group_by(category, genre2) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count))

# Bar plot with counts by category and gender (type)
p1 <- ggplot(summary_com, aes(x = reorder(category, -count), y = count, fill = genre2)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = count), 
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3) +
  labs(
    title = "Com - Effectif par type de lexique et genre",
    x = "Catégorie lexique",
    y = "Effectif",
    fill = "Genre"
  ) +
    scale_fill_manual(values = c("Femme" = "#66023C", "Homme" = "#318CE7")) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )

p2 <- ggplot(summary_post, aes(x = reorder(category, -count), y = count, fill = genre2)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = count), 
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3) +
    scale_fill_manual(values = c("Femme" = "#66023C", "Homme" = "#318CE7")) +
  labs(
    title = "Post - Effectif par type de lexique et genre",
    x = "Catégorie lexique",
    y = "Effectif",
    fill = "Genre"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )
Voir le code
p2

Post - Lexique violence par genre
Voir le code
p1

Commentaires - Lexique violence par genre

7 Post - Répartition de genre des termes du lexique

L’ajout de lexique des violences à base rend la visualisation plus compliqué. Pour la faciliter, je découpe selon le type des violences. Le problème, c’est que ça ne permet pas de comparaison à travers les lexiques.

Voir le code
# Create term-to-category mapping
term_to_category <- c(
  setNames(rep("generale", length(generale)), generale),
  setNames(rep("physique", length(physique)), physique),
  setNames(rep("verbale", length(verbale)), verbale),
  setNames(rep("psy", length(psy)), psy),
  setNames(rep("harcel", length(harcel)), harcel),
  setNames(rep("sexuelle", length(sexuelle)), sexuelle),
  setNames(rep("domeco", length(domeco)), domeco),
  setNames(rep("numerique", length(numerique)), numerique),
  setNames(rep("discrimination", length(discrimination)), discrimination),
  setNames(rep("insti_symb", length(insti_symb)), insti_symb)
)
create_category_plot <- function(data, categories, plot_title, min_occurrence = 2) {
  # Filter terms from specified categories
  category_terms <- names(term_to_category)[term_to_category %in% categories]
  
  # Filter and process data
  data_filtered <- data %>%
    mutate(flag_terms = str_split(flag, ",\\s*")) %>%
    filter(map_lgl(flag_terms, ~ any(.x %in% category_terms))) %>%
    mutate(
      flag_group2 = map_chr(
        flag_terms,
        ~ {
          filtered_terms <- .x[.x %in% category_terms]
          if (length(filtered_terms) == 0) return("")
          mapped <- map_chr(filtered_terms, map_term_to_group)
          paste(unique(mapped), collapse = ", ")
        }
      )
    ) %>%
    filter(flag_group2 != "")
  
  if (nrow(data_filtered) == 0) {
    return(
      ggplot() +
        annotate("text", x = 0.5, y = 0.5, 
                 label = paste("Aucune donnée pour", plot_title),
                 size = 6, color = "gray") +
        theme_void()
    )
  }
  
  # Create wide format
  vss_wide_filtered <- data_filtered %>%
    separate_rows(flag_group2, sep = ",\\s*") %>%
    filter(flag_group2 != "") %>%
    mutate(value = 1) %>%
    distinct(docid, flag_group2, .keep_all = TRUE) %>%
    pivot_wider(
      names_from = flag_group2,
      values_from = value,
      values_fill = 0,
      names_prefix = "flag_"
    )
  
  # Prepare for plotting
  data_for_plot <- vss_wide_filtered %>%
    pivot_longer(
      cols = where(is.numeric) & starts_with("flag_"),
      names_to = "flag_words",
      values_to = "present"
    ) %>%
    mutate(flag_words = str_remove(flag_words, "^flag_")) %>%
    filter(present == 1) %>%
    count(flag_words, genre2, name = "effectif") %>%
    
    # FILTER HERE: Remove terms with less than min_occurrence total across genders
    group_by(flag_words) %>%
    mutate(total_effectif = sum(effectif)) %>%
    ungroup() %>%
    filter(total_effectif >= min_occurrence) %>%  # Keep only terms with at least min_occurrence
    
    group_by(genre2) %>%
    mutate(
      total_in_genre2 = sum(effectif),
      frequence = effectif / total_in_genre2 * 100,
      freq = ifelse(genre2 == "Femme", -frequence, frequence)
    ) %>%
    ungroup()
  
  if (nrow(data_for_plot) == 0) {
    return(
      ggplot() +
        annotate("text", x = 0.5, y = 0.5, 
                 label = paste("Aucun terme avec au moins ", min_occurrence, 
                              " occurrences pour", plot_title),
                 size = 6, color = "gray") +
        theme_void()
    )
  }
  
  # Sort by total frequency (after filtering)
  flag_order <- data_for_plot %>%
    group_by(flag_words) %>%
    summarise(total = sum(effectif)) %>%
    arrange(desc(total)) %>%
    pull(flag_words)
  
  data_for_plot <- data_for_plot %>%
    mutate(flag_words = factor(flag_words, levels = flag_order))
  
  # Calculate statistics for subtitle
  total_posts <- nrow(data_filtered)
  total_terms <- n_distinct(data_for_plot$flag_words)
  removed_terms <- n_distinct(vss_wide_filtered %>% 
                                pivot_longer(cols = where(is.numeric) & starts_with("flag_"),
                                             names_to = "flag_words",
                                             values_to = "present") %>%
                                mutate(flag_words = str_remove(flag_words, "^flag_")) %>%
                                filter(present == 1) %>%
                                count(flag_words) %>%
                                filter(n < min_occurrence) %>%
                                pull(flag_words))
  
  # Create plot
  ggplot(data_for_plot, aes(x = freq, y = flag_words, fill = genre2)) +
    geom_col(width = 0.8) +
    geom_text(
      aes(label = paste0(round(abs(freq), 1), "%")),
      position = position_stack(vjust = 0.5),
      size = 3,
      color = "white"
    ) +
    scale_x_continuous(labels = function(x) paste0(abs(x), "%")) +
    scale_fill_manual(values = c("Femme" = "#66023C", "Homme" = "#318CE7")) +
    labs(
      x = "Pourcentage",
      y = "Termes",
      title = plot_title,
      subtitle = paste(
        "Posts analysés: ", total_posts, 
        " | Termes affichés: ", total_terms,
        " (", removed_terms, " termes avec <", min_occurrence, " occurrence(s) masqués)"
      )
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 11, color = "gray50", lineheight = 1.2),
      legend.position = "bottom",
      legend.title = element_blank()
    )
}

strict_filt <- vss_post_plus %>% 
  filter(!is.na(genre2))

Ces analyses sont faites sur la base de post stricte, avec les auteurs au genre inconnu filtré, et les mots utilisés moins de 2 fois non affiché, la base contient donc 976 posts.

Voir le code
create_category_plot(
  strict_filt,
  categories = c("generale", "physique"),
  plot_title = "Violences Générales & Physiques"
)

Distribution par genre - Violences Générales & Physiques
Voir le code
create_category_plot(
  strict_filt,
  categories = c("psy", "verbale"),
  plot_title = "Violences Psychologiques & Verbales"
)

Distribution par genre - Psy et Verbales
Voir le code
create_category_plot(
  strict_filt,
  categories = c("domeco", "insti_symb"),
  plot_title = "Violences Conjugales/Économiques & Institutionnelles/Symboliques"
)

Distribution par genre - Domeco et conjugales
Voir le code
create_category_plot(
  strict_filt,
  categories = c("sexuelle"),
  plot_title = "Violences Sexuelles"
)

Distribution par genre - Violences sexuelles
Voir le code
create_category_plot(
  strict_filt,
  categories = c("harcel", "numerique", "discrimination"),
  plot_title = "Harcèlement-Discrimination"
)

Distribution par genre - Harcèlement et discrimination

8 Commentaires - Répartition de genre des termes utilisés

Voir le code
com_strict_filt <- vss_com_plus %>% 
  filter(!is.na(genre2))

Ces analyses sont faites sur la base de commentaires stricte, avec les auteurs au genre inconnu filtré, et les mots utilisés moins de 2 fois non affiché, la base contient donc 18515 commentaires.

Voir le code
create_category_plot(
  com_strict_filt,
  categories = c("generale", "physique"),
  plot_title = "Violences Générales & Physiques"
)

Distribution par genre - Violences Générales & Physiques
Voir le code
create_category_plot(
  com_strict_filt,
  categories = c("psy", "verbale"),
  plot_title = "Violences Psychologiques & Verbales"
)

Distribution par genre - Psy et Verbales
Voir le code
create_category_plot(
  com_strict_filt,
  categories = c("domeco", "insti_symb"),
  plot_title = "Violences Conjugales/Économiques & Institutionnelles/Symboliques"
)

Distribution par genre - Domeco et conjugales
Voir le code
create_category_plot(
  com_strict_filt,
  categories = c("sexuelle"),
  plot_title = "Violences Sexuelles"
)

Distribution par genre - Violences sexuelles
Voir le code
create_category_plot(
  com_strict_filt,
  categories = c("harcel", "numerique", "discrimination"),
  plot_title = "Harcèlement-Discrimination"
)

Distribution par genre - Harcèlement et discrimination

Pour permettre une comparaison, on peut faire la même chose avec les 20 termes qui reviennent le plus dans les posts sélectionnés :

8.1 Répartition genre top 20 des termes avec le plus de post et commentaires

Voir le code
create_top20_pyramid_plot <- function(vss_post_plus_data, stat_post_data, 
                                      plot_title = "Top 20 des termes les plus fréquents") {
  
  # Get top 20 term groups from stat_post
  top_20_groups <- stat_post_data %>%
    arrange(desc(nb_post_relation)) %>%
    slice_head(n = 20) %>%
    pull(term_group)
  
  # Process the data to get gender distribution for these top 20 terms
  # Using the same logic as your original plot function
  
  # First, filter vss_post_plus to only include posts with flag terms
  data_processed <- vss_post_plus_data %>%
    filter(flag != "")  # Only posts with flags
  
  if (nrow(data_processed) == 0) {
    return(
      ggplot() +
        annotate("text", x = 0.5, y = 0.5, 
                 label = "Aucune donnée disponible",
                 size = 6, color = "gray") +
        theme_void()
    )
  }
  
  # Create wide format (same as your original code)
  vss_wide <- data_processed %>%
    separate_rows(flag, sep = ",\\s*") %>%
    filter(flag != "") %>%
    
    # Map individual terms to groups using your map_term_to_group function
    mutate(flag_group = map_chr(flag, map_term_to_group)) %>%
    
    # Filter to keep only top 20 groups
    filter(flag_group %in% top_20_groups) %>%
    
    mutate(value = 1) %>%
    distinct(docid, flag_group, .keep_all = TRUE) %>%
    pivot_wider(
      names_from = flag_group,
      values_from = value,
      values_fill = 0,
      names_prefix = "flag_"
    )
  
  # Prepare data for plotting (same as your original code)
  data_for_plot <- vss_wide %>%
    pivot_longer(
      cols = where(is.numeric) & starts_with("flag_"),
      names_to = "flag_words",
      values_to = "present"
    ) %>%
    mutate(flag_words = str_remove(flag_words, "^flag_")) %>%
    filter(present == 1) %>%
    count(flag_words, genre2, name = "effectif") %>%
    
    # Remove NA genders if any
    filter(!is.na(genre2)) %>%
    
    group_by(genre2) %>%
    mutate(
      total_in_genre2 = sum(effectif),
      frequence = effectif / total_in_genre2 * 100,
      freq = ifelse(genre2 == "Femme", -frequence, frequence)
    ) %>%
    ungroup()
  
  if (nrow(data_for_plot) == 0) {
    return(
      ggplot() +
        annotate("text", x = 0.5, y = 0.5, 
                 label = "Aucun terme trouvé parmi le top 20",
                 size = 6, color = "gray") +
        theme_void()
    )
  }
  
  # Sort terms by overall frequency (from stat_post order)
  # Get the order from stat_post to maintain consistency
  group_order <- stat_post_data %>%
    filter(term_group %in% top_20_groups) %>%
    arrange(desc(nb_post_relation)) %>%
    pull(term_group)
  
  # Convert to factor with the correct order
  data_for_plot <- data_for_plot %>%
    mutate(flag_words = factor(flag_words, levels = rev(group_order)))  # rev() for descending order in plot
  
  # Calculate statistics
  total_posts <- nrow(data_processed)
  total_posts_with_top20 <- nrow(vss_wide)
  total_mentions <- sum(data_for_plot$effectif)
  
  # Calculate coverage: what percentage of all flagged posts contain at least one top 20 term
  all_flagged_posts <- vss_post_plus_data %>%
    filter(flag != "") %>%
    nrow()
  
  coverage <- round((total_posts_with_top20 / all_flagged_posts) * 100, 1)
  
  # Create pyramid plot (identical to your original style)
  ggplot(data_for_plot, aes(x = freq, y = flag_words, fill = genre2)) +
    geom_col(width = 0.8) +
    geom_text(
      aes(label = paste0(round(abs(freq), 1), "%")),
      position = position_stack(vjust = 0.5),
      size = 3,
      color = "white"
    ) +
    scale_x_continuous(labels = function(x) paste0(abs(x), "%")) +
    scale_fill_manual(values = c("Femme" = "#66023C", "Homme" = "#318CE7")) +
    labs(
      x = "Pourcentage",
      y = "Termes",
      title = plot_title,
      subtitle = paste(
        "Nombre posts: ", total_posts
        
      )
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 11, color = "gray50", lineheight = 1.2),
      legend.position = "bottom",
      legend.title = element_blank(),
      axis.text.y = element_text(size = 10)
    )
}
Voir le code
# Usage
top20_post <- create_top20_pyramid_plot(
  vss_post_plus_data = strict_filt,
  stat_post_data = stat_post,
  plot_title = " Post Top 20 des termes de violence"
)

# Display the plot
top20_post

Post - Top 20
Voir le code
create_top20_pyramid_plot <- function(vss_post_plus_data, stat_post_data, 
                                      plot_title = "Top 20 des termes les plus fréquents") {
  
  # Create a wrapper function that only needs the term
  map_term_simple <- function(term) {
    # Use a default category or try to determine it
    map_term_to_group(term, category = "generale")  # Default category
  }
  
  # Get top 20 term groups from stat_post
  top_20_groups <- stat_post_data %>%
    arrange(desc(nb_com)) %>%
    slice_head(n = 20) %>%
    pull(term_group)
  
  # Process the data to get gender distribution for these top 20 terms
  
  # First, filter vss_post_plus to only include posts with flag terms
  data_processed <- vss_post_plus_data %>%
    filter(flag != "")  # Only posts with flags
  
  if (nrow(data_processed) == 0) {
    return(
      ggplot() +
        annotate("text", x = 0.5, y = 0.5, 
                 label = "Aucune donnée disponible",
                 size = 6, color = "gray") +
        theme_void()
    )
  }
  
  # Create wide format
  vss_wide <- data_processed %>%
    separate_rows(flag, sep = ",\\s*") %>%
    filter(flag != "") %>%
    
    # Use the wrapper function instead
    mutate(flag_group = map_chr(flag, map_term_simple)) %>%
    
    # Filter to keep only top 20 groups
    filter(flag_group %in% top_20_groups) %>%
    
    mutate(value = 1) %>%
    distinct(docid, flag_group, .keep_all = TRUE) %>%
    pivot_wider(
      names_from = flag_group,
      values_from = value,
      values_fill = 0,
      names_prefix = "flag_"
    )
  
  # Prepare data for plotting
  data_for_plot <- vss_wide %>%
    pivot_longer(
      cols = where(is.numeric) & starts_with("flag_"),
      names_to = "flag_words",
      values_to = "present"
    ) %>%
    mutate(flag_words = str_remove(flag_words, "^flag_")) %>%
    filter(present == 1) %>%
    count(flag_words, genre2, name = "effectif") %>%
    
    # Remove NA genders if any
    filter(!is.na(genre2)) %>%
    
    group_by(genre2) %>%
    mutate(
      total_in_genre2 = sum(effectif),
      frequence = effectif / total_in_genre2 * 100,
      freq = ifelse(genre2 == "Femme", -frequence, frequence)
    ) %>%
    ungroup()
  
  if (nrow(data_for_plot) == 0) {
    return(
      ggplot() +
        annotate("text", x = 0.5, y = 0.5, 
                 label = "Aucun terme trouvé parmi le top 20",
                 size = 6, color = "gray") +
        theme_void()
    )
  }
  
  # Sort terms by overall frequency (from stat_post order)
  # Use nb_com to match the selection criteria
  group_order <- stat_post_data %>%
    filter(term_group %in% top_20_groups) %>%
    arrange(desc(nb_com)) %>%  # Use nb_com to be consistent
    pull(term_group)
  
  # Convert to factor with the correct order
  data_for_plot <- data_for_plot %>%
    mutate(flag_words = factor(flag_words, levels = rev(group_order)))
  
  # Calculate statistics
  total_posts <- nrow(data_processed)
  total_posts_with_top20 <- nrow(vss_wide)
  total_mentions <- sum(data_for_plot$effectif)
  
  # Calculate coverage: what percentage of all flagged posts contain at least one top 20 term
  all_flagged_posts <- vss_post_plus_data %>%
    filter(flag != "") %>%
    nrow()
  
  coverage <- ifelse(all_flagged_posts > 0, 
                     round((total_posts_with_top20 / all_flagged_posts) * 100, 1),
                     0)
  
  # Create pyramid plot
  ggplot(data_for_plot, aes(x = freq, y = flag_words, fill = genre2)) +
    geom_col(width = 0.8) +
    geom_text(
      aes(label = paste0(round(abs(freq), 1), "%")),
      position = position_stack(vjust = 0.5),
      size = 3,
      color = "white"
    ) +
    scale_x_continuous(labels = function(x) paste0(abs(x), "%")) +
    scale_fill_manual(values = c("Femme" = "#66023C", "Homme" = "#318CE7")) +
    labs(
      x = "Pourcentage",
      y = "Termes",
      title = plot_title,
      subtitle = paste(
        "Nombre com: ", total_posts
      )
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 11, color = "gray50", lineheight = 1.2),
      legend.position = "bottom",
      legend.title = element_blank(),
      axis.text.y = element_text(size = 10)
    )
}

# Usage
top20_com <- create_top20_pyramid_plot(
  vss_post_plus_data = com_strict_filt,
  stat_post_data = stat_post,
  plot_title = "Com Top 20 des termes de violences"
)

# Display the plot
top20_com

Commentaires - Top 20
Voir le code
# Usage

On peut également visualiser les écarts d’une autre manière, en regardant le pourcentage d’utilisation des termes par genre. Au sein de l’ensemble des posts contenant un terme du lexique, quelle part ont été postés par un homme et quelle part ont été écrits par une femme ? (à refaire pour les com / en fusionnant com et post ? + en interrogeant la présence post/com ?)

Voir le code
top_20_groups <- stat_post %>%
  arrange(desc(nb_post_relation)) %>%
  slice_head(n = 20) %>%
  pull(term_group)

# Step 2: Create term_order (order of terms by frequency)
term_order <- stat_post %>%
  filter(term_group %in% top_20_groups) %>%
  arrange(desc(nb_post_relation)) %>%
  pull(term_group)

# Step 3: Calculate overall gender distribution from the FULL dataset
overall_dist <- strict_filt %>%
  summarise(
    femme_pct = sum(genre2 == "Femme", na.rm = TRUE) / n() * 100,
    homme_pct = sum(genre2 == "Homme", na.rm = TRUE) / n() * 100,
    na_pct = sum(is.na(genre2)) / n() * 100,
    total_posts = n()
  )


simple_data_plot <- strict_filt %>%
  separate_rows(flag, sep = ",\\s*") %>%
  filter(flag != "") %>%
  mutate(flag_group2 = map_chr(flag, map_term_to_group)) %>%
  filter(flag_group2 %in% top_20_groups) %>%
  distinct(docid, flag_group2, .keep_all = TRUE) %>%  # Remove duplicate term groups per doc
  count(flag_group2, genre2, name = "effectif") %>%
  mutate(genre2 = if_else(is.na(genre2), "Non spécifié", genre2)) %>%
  group_by(flag_group2) %>%
  mutate(
    total_term = sum(effectif),
    pct = effectif / total_term * 100
  ) %>%
  ungroup() %>%
  mutate(
    flag_group2 = factor(flag_group2, levels = rev(term_order))
  )

# Step 5: Define colors
genre_colors <- c(
  "Femme" = "#66023C",       
  "Homme" = "#318CE7",
  "Non spécifié" = "gray70"
)

# Step 6: Create the plot
top20_dist_plot <- ggplot(simple_data_plot, aes(x = pct/100, y = flag_group2, fill = genre2)) +
  geom_col(position = "fill", width = 0.7) +
  
  # Vertical lines for overall averages
  geom_vline(xintercept = overall_dist$homme_pct/100, 
             color = "turquoise",
             linetype = "dashed", 
             linewidth = 1,
             alpha = 0.7)+
  # Percentage labels on bars
  geom_text(
    aes(label = paste0(round(pct, 0), "%")),
    position = position_fill(vjust = 0.5),
    color = "white",
    fontface = "bold",
    size = 3.2
  ) + 
  
  # Scales
  scale_x_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.1))
  ) +
  
  scale_fill_manual(
    values = genre_colors,
    name = "Genre"
  ) +
  
  # Labels
  labs(
    x = "Proportion par genre",
    y = "Termes (top 20)",
    title = "Top 20 des termes - Distribution par genre",
    subtitle = paste0(
      "Distribution globale (tous posts): ", 
      round(overall_dist$femme_pct, 1), "% Femme, ",
      round(overall_dist$homme_pct, 1), "% Homme\n",
      "Total posts analysés: ", nrow(strict_filt)
    )
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, color = "gray50", lineheight = 1.2, hjust = 0.5),
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12, face = "bold")
  )

# Step 7: Display the plot
top20_dist_plot

8.2 Slope graph

On peut également interroger la différence dans les termes qui reviennent le plus entre femmes et hommes, et dans les commentaires. Cette différence de rang est faites sur l’ensemble des textes contenant un mot du lexique, soit la fusion des bases posts et commentaires strictes. J’ai calculé quel termes revenaient le plus dans l’ensemble des textes recueilli, puis créer un classement de fréquence de ces termes en fonction du genre et de la provenance du texte (post/commentaire). Les graphs ci-dessous représentent les termes en commun dans le top20 des termes les plus fréquents, et leur différence de rang selon le genre de l’auteur.ices du texte et la provenance.

Voir le code
create_term_groups_wide <- function(data) {
  data %>%
    # Split flags into individual terms
    mutate(term = str_split(flag, ",\\s*")) %>%
    unnest(term) %>%
    # Clean terms
    mutate(term = str_trim(term)) %>%
    filter(term != "") %>%
    # Map terms to groups
    mutate(term_group = map2_chr(term, NA, map_term_to_group)) %>%
    # Remove duplicates within each document (same term group multiple times)
    distinct(docid, genre2, type, term_group) %>%
    # Create dichotomous variables
    mutate(present = 1) %>%
    pivot_wider(
      id_cols = c(docid, genre2, type),
      names_from = term_group,
      names_prefix = "term_",
      values_from = present,
      values_fill = 0
    )
}

# Apply the function
all_text_wide <- create_term_groups_wide(all_text)


top_terms_genre <- all_text_wide %>% 
  group_by(genre2) %>% 
  select(genre2, starts_with("term_")) %>%  # Include genre2 in select
  summarise(across(starts_with("term_"), sum)) %>%  # Summarize only term columns
  pivot_longer(
    cols = -genre2,  # Exclude genre2 from pivoting
    names_to = "term_group",
    values_to = "total_count"
  ) %>%
  mutate(term_group = str_remove(term_group, "term_")) %>%
  group_by(genre2) %>%  # Group again for top 20 per genre
  arrange(genre2, desc(total_count)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 20) %>%
  ungroup() %>%
  select(-rank)


comparison_table <- top_terms_genre %>%
  filter(genre2 %in% c("Femme", "Homme")) %>%
  group_by(genre2) %>%
  mutate(rank = rank(-total_count, ties.method = "first")) %>%
  ungroup() %>%
  select(genre2, term_group, total_count, rank) %>%
  pivot_wider(
    id_cols = term_group,
    names_from = genre2,
    values_from = c(total_count, rank),
    names_sep = "_"
  ) %>%
  # Keep only terms in both lists
  filter(!is.na(total_count_Femme) & !is.na(total_count_Homme)) %>%
  mutate(
    rank_diff = abs(rank_Femme - rank_Homme),
    count_ratio = total_count_Femme / total_count_Homme
  ) %>%
  arrange(rank_Femme)

slope_data <- comparison_table %>%
  # Keep only common terms between both genders
  filter(!is.na(rank_Femme) & !is.na(rank_Homme)) %>%
  # Reshape for slope graph
  select(term_group, rank_Femme, rank_Homme) %>%
  pivot_longer(
    cols = c(rank_Femme, rank_Homme),
    names_to = "gender",
    values_to = "rank"
  ) %>%
  mutate(
    gender = ifelse(gender == "rank_Femme", "Femme", "Homme"),
    gender = factor(gender, levels = c("Femme", "Homme")),
    # Convert rank to reverse so 1 is at top (optional)
    rank_rev = 21 - rank,  # Makes rank 1 appear at top of graph
    label = ifelse(gender == "Femme", as.character(term_group), "")
  )

top_meuf <- top_terms_genre %>% 
  filter(genre2 == "Femme")

top_mec <- top_terms_genre %>% 
  filter(genre2 == "Homme")

commun <- sum(top_meuf$term_group %in% top_mec$term_group)
Voir le code
slope_data_colored <- slope_data %>%
  left_join(comparison_table %>% select(term_group, rank_diff), 
            by = "term_group") %>%
  mutate(
    diff_category = case_when(
      rank_diff == 0 | rank_diff == 1 ~ "0-1",
      rank_diff == 2 ~ "2",
      rank_diff == 3 ~ "3",
      rank_diff >= 4 ~ "4-5"
    ))
p3 <- ggplot(slope_data_colored, 
             aes(x = gender, y = rank, group = term_group)) +
  # Lines with color aesthetic
  geom_line(aes(color = diff_category), size = 1, alpha = 0.8) +
  # Points with color aesthetic
  geom_point(aes(color = diff_category), size = 3) +
  # Labels on Femme side (left) - BLACK TEXT
  geom_text_repel(data = slope_data_colored %>% 
                    filter(gender == "Femme"),
                  aes(label = term_group),
                  nudge_x = -0.3,
                  direction = "y",
                  hjust = 1,
                  segment.color = NA,
                  size = 3.5,
                  color = "black",  # Explicitly set text color to black
                  show.legend = FALSE,
                  max.overlaps = 20) +
  # Labels on Homme side (right) - BLACK TEXT
  geom_text_repel(data = slope_data_colored %>% 
                    filter(gender == "Homme"),
                  aes(label = term_group),
                  nudge_x = 0.3,
                  direction = "y",
                  hjust = 0,
                  segment.color = NA,
                  size = 3.5,
                  color = "black",  # Explicitly set text color to black
                  show.legend = FALSE,
                  max.overlaps = 20) +
  scale_color_manual(
    values = c("4-5" = "#d73027", 
               "3" = "#fdae61", 
               "2" = "#4575b4", 
               "0-1" = "grey"),
    name = "Différence de rang"
  ) +
  labs(
    title = "Différences de rang dans les mots les plus utilisés",
    x = "",
    y = "Rang (1 = plus fréquent)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    axis.text = element_text(size = 12),
    axis.text.x = element_text(size = 13, face = "bold"),
    legend.position = "bottom",
    legend.text = element_text(size = 11),
    legend.title = element_text(size = 12, face = "bold"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  ) +
  scale_y_reverse()

p3

Nombre de termes en commun : 19/20.

Voir le code
top_terms_prov <- all_text_wide %>% 
  group_by(type) %>% 
  select(type, starts_with("term_")) %>%  # Include genre2 in select
  summarise(across(starts_with("term_"), sum)) %>%  # Summarize only term columns
  pivot_longer(
    cols = -type,  # Exclude genre2 from pivoting
    names_to = "term_group",
    values_to = "total_count"
  ) %>%
  mutate(term_group = str_remove(term_group, "term_")) %>%
  group_by(type) %>%  # Group again for top 20 per genre
  arrange(type, desc(total_count)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 20) %>%
  ungroup() %>%
  select(-rank)

prov_post <- top_terms_prov %>% 
  filter(type == "post")
prov_com <- top_terms_prov %>% 
  filter(type == "com")




comparison_table <- top_terms_prov %>%
  group_by(type) %>%
  mutate(rank = rank(-total_count, ties.method = "first")) %>%
  ungroup() %>%
  select(type, term_group, total_count, rank) %>%
  pivot_wider(
    id_cols = term_group,
    names_from = type,
    values_from = c(total_count, rank),
    names_sep = "_"
  ) %>%
  # Keep only terms in both lists
  filter(!is.na(total_count_com) & !is.na(total_count_post)) %>%
  mutate(
    rank_diff = abs(rank_com - rank_post),
    count_ratio = total_count_com / total_count_post
  ) %>%
  arrange(rank_post)


slope_data <- comparison_table %>%
  # Keep only common terms between both genders
  filter(!is.na(rank_post) & !is.na(rank_com)) %>%
  # Reshape for slope graph
  select(term_group, rank_post, rank_com) %>%
  pivot_longer(
    cols = c(rank_post, rank_com),
    names_to = "gender",
    values_to = "rank"
  ) %>%
  mutate(
    type = ifelse(gender == "rank_post", "post", "com"),
    type = factor(type, levels = c("post", "com")),
    # Convert rank to reverse so 1 is at top (optional)
    rank_rev = 21 - rank,  # Makes rank 1 appear at top of graph
    label = ifelse(gender == "post", as.character(term_group), "")
  )


slope_data_colored <- slope_data %>%
  left_join(comparison_table %>% select(term_group, rank_diff), 
            by = "term_group") %>%
  mutate(
    diff_category = case_when(
      rank_diff == 0 | rank_diff == 1 ~ "0-1",
      rank_diff == 2 |rank_diff == 3 ~ "2-3",
      rank_diff %in% 3:4 ~ "3-4",
      rank_diff > 4 ~ ">4"
    ),
    diff_category = fct_relevel(diff_category,
                                "0-1","2-3","3-4",">4") )


p4 <- ggplot(slope_data_colored, 
             aes(x = type, y = rank, group = term_group)) +
  # Lines with color aesthetic
  geom_line(aes(color = diff_category), size = 1, alpha = 0.8) +
  # Points with color aesthetic
  geom_point(aes(color = diff_category), size = 3) +
  # Labels on Femme side (left) - BLACK TEXT
  geom_text_repel(data = slope_data_colored %>% 
                    filter(type == "post"),
                  aes(label = term_group),
                  nudge_x = -0.3,
                  direction = "y",
                  hjust = 1,
                  segment.color = NA,
                  size = 3.5,
                  color = "black",  # Explicitly set text color to black
                  show.legend = FALSE,
                  max.overlaps = 20) +
  # Labels on Homme side (right) - BLACK TEXT
  geom_text_repel(data = slope_data_colored %>% 
                    filter(type == "com"),
                  aes(label = term_group),
                  nudge_x = 0.3,
                  direction = "y",
                  hjust = 0,
                  segment.color = NA,
                  size = 3.5,
                  color = "black",  # Explicitly set text color to black
                  show.legend = FALSE,
                  max.overlaps = 20) +
  scale_color_manual(
    values = c(">4" = "#d73027", 
               "3-4" = "#fdae61", 
               "2-3" = "#4575b4", 
               "0-1" = "grey"),
    name = "Différence de rang"
  ) +
  labs(
    title = "Différences de rang dans les mots les plus utilisés",
    x = "",
    y = "Rang (1 = plus fréquent)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    axis.text = element_text(size = 12),
    axis.text.x = element_text(size = 13, face = "bold"),
    legend.position = "bottom",
    legend.text = element_text(size = 11),
    legend.title = element_text(size = 12, face = "bold"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  ) +
  scale_y_reverse()

p4

Nombre de termes en commun : 18/20.

9 Croisement entre le genre de l’auteur.ice du post et celui des répondant.es

Pour rappel, le croisement du genre des auteurs de commentaires et des auteur.ices de posts auxquels ils répondent sur l’ensemble de la base de données.

On le fait sur plusieurs base :

  • La base complète à titre de comparaison, sur l’ensemble des données récoltées sur AskMec et AskMeuf

  • La base commentaire stricte, les commentaires qui contiennent un mot du lexique

  • La base commentaire étendue, les commentaires qui contiennent un mot du lexique OU ayant au moins 3 commentaires contenant un mot du lexique.

Voir le code
cross_genre <- com %>%
  filter(!is.na(genre2) & !is.na(genre2_posteur)) %>% 
  count(genre2, genre2_posteur, subreddit) %>%
  group_by(subreddit) %>%                  # Calculate % within each source × typo_label
  mutate(percent = n / sum(n) * 100) %>%
  ungroup()

ggplot(cross_genre, aes(x = genre2, y = genre2_posteur, fill = percent)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.1f%%", percent)), color = "black", size = 4) +
  facet_grid(. ~ subreddit) +                                                # Facet by source × typo_post
  scale_fill_gradient(low = "white", high = "#FF7F50", name = "Pourcentage (%)") +
  labs(x = "Genre (Répondant.e)",
       y = "Genre (Auteur.ice du post)"
  ) +
  theme_minimal() +
  theme(  strip.text = element_text(face = "bold", size = 12),
          axis.text.x = element_text(angle = 45, hjust = 1),
          panel.grid = element_blank())

Ci-dessous, le genre des auteurices de commentaires comprenant un terme du lexique et celui auteur.ices des posts auxquels ces commentaires répondent (les na ne sont pas représentés, soit 268 commentaires sur 18783).

Voir le code
cross_genre <- vss_com_plus %>%
  filter(!is.na(genre2) & !is.na(genre2_posteur)) %>% 
  count(genre2, genre2_posteur, subreddit) %>%
  group_by(subreddit) %>%                  # Calculate % within each source × typo_label
  mutate(percent = n / sum(n) * 100) %>%
  ungroup()

ggplot(cross_genre, aes(x = genre2, y = genre2_posteur, fill = percent)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.1f%%", percent)), color = "black", size = 4) +
  facet_grid(. ~ subreddit) +                                                # Facet by source × typo_post
  scale_fill_gradient(low = "white", high = "#FF7F50", name = "Pourcentage (%)") +
  labs(x = "Genre (Répondant.e)",
       y = "Genre (Auteur.ice du post)"
  ) +
  theme_minimal() +
  theme(  strip.text = element_text(face = "bold", size = 12),
          axis.text.x = element_text(angle = 45, hjust = 1),
          panel.grid = element_blank())

Voir le code
cross_genre <- com_post_vss %>%
  filter(!is.na(genre2) & !is.na(genre2_posteur)) %>% 
  count(genre2, genre2_posteur, subreddit) %>%
  group_by(subreddit) %>%                  # Calculate % within each source × typo_label
  mutate(percent = n / sum(n) * 100) %>%
  ungroup()

ggplot(cross_genre, aes(x = genre2, y = genre2_posteur, fill = percent)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.1f%%", percent)), color = "black", size = 4) +
  facet_grid(. ~ subreddit) +                                                # Facet by source × typo_post
  scale_fill_gradient(low = "white", high = "#FF7F50", name = "Pourcentage (%)") +
  labs(x = "Genre (Répondant.e)",
       y = "Genre (Auteur.ice du post)"
  ) +
  theme_minimal() +
  theme(  strip.text = element_text(face = "bold", size = 12),
          axis.text.x = element_text(angle = 45, hjust = 1),
          panel.grid = element_blank())

Sur les commentaires qui répondent à un post qui ne contient PAS de termes du lexique.

Voir le code
cross_genre <- vss_com_plus %>%
  filter(clé %in% post_plusplus$links) %>% 
  filter(!is.na(genre2) & !is.na(genre2_posteur)) %>% 
  count(genre2, genre2_posteur, subreddit) %>%
  group_by(subreddit) %>%                  # Calculate % within each source × typo_label
  mutate(percent = n / sum(n) * 100) %>%
  ungroup()

ggplot(cross_genre, aes(x = genre2, y = genre2_posteur, fill = percent)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.1f%%", percent)), color = "black", size = 4) +
  facet_grid(. ~ subreddit) +                                                # Facet by source × typo_post
  scale_fill_gradient(low = "white", high = "#FF7F50", name = "Pourcentage (%)") +
  labs(x = "Genre (Répondant.e)",
       y = "Genre (Auteur.ice du post)"
  ) +
  theme_minimal() +
  theme(  strip.text = element_text(face = "bold", size = 12),
          axis.text.x = element_text(angle = 45, hjust = 1),
          panel.grid = element_blank())

Pour la suite : un graphique des coocurences des termes sélectionnés type https://slcladal.netlify.app/coll.html#3_Visualizing_Collocations ? Prendre le temps de lire de façon plus quali ce qu’on a sélectionné ? Chercher les posts “Témoignage ?”